covidHubUtils R packageThe COVID-19 Forecast Hub is a central repository for modeler-contributed short-term forecasts of COVID-19 trends in the US. The US Centers for Disease Control and Prevention (CDC) displays forecasts from the Forecast Hub on its modeling and forecasting webpages.
The Forecast Hub has been curating forecast data since April 2020, and has collected over 150 million unique rows of forecast data. These data are stored in our public GitHub repository and in the Zoltar forecast archive.
The goal of the covidHubUtils R package is to create a set of basic utility functions for accessing, visualizing, and scoring forecasts from the COVID-19 Forecast Hub.
The covidHubUtils package relies on a small number of packages, including many from the tidyverse and, importantly, the zoltr package that is used to access the Zoltar API for downloading forecasts. Please install zoltr from GitHub, as this development version often has important features not yet on the CRAN version:
devtools::install_github("reichlab/zoltr")
The covidHubUtils package currently is only available on GitHub, and it may be installed using the devtools package:
devtools::install_github("reichlab/covidHubUtils")
One of the key features of the COVID-19 Forecast Hub is making millions of rows of forecast data available in a standard format for easy analysis and visualization. The covidHubUtils package allows for users to download data into an R session either by reading files from a local clone of the COVID-19 Forecast Hub repository or by downloading data from the Zoltar API. (While Zoltar currently requires a user account to download data via the API, we have created a specific user account for covidHubUtils so that a user account is not needed.)
We have identified two central use cases for downloading data:
load_latest_forecasts() function.load_forecasts() function.Below are some examples of reading in data. We start by loading the covidHubUtils package and the tidyverse.
library(covidHubUtils)
library(tidyverse)
theme_set(theme_bw())
The following code loads 1 through 4 week ahead incident case forecasts from Zoltar for the COVIDhub-ensemble model. Note that the forecast_date_window_size option specifies the range of days to look at for a forecast. So the query below is looking for the most recent forecast from COVIDhub-ensemble in the span of 2020-12-01 through 2020-12-07. We have decided to set forecast_date_window_size to 6 because a window size of 0 still includes the last forecast date you provided. So that last date minus 6 days will cover a week, starting from 6 days before the last date and going up through the last date.
We also load the hospitalization data and incident deaths here by changing the target to the desired quantity and load that dataset to further work on. The target variable can be changed to what is desired by changing the appropriate lines of code.
The verbose parameter has been explicitly set to false here to prevent the multiple lines of output of the inner workings of the load function. It is set to false by default and all the information is printed to the console unless specified otherwise.
# Load forecasts that were submitted in a time window from zoltar
inc_case_targets <- paste(1:4, "wk ahead inc case")
forecasts <- load_latest_forecasts(models = "COVIDhub-ensemble",
last_forecast_date = "2021-03-08",
forecast_date_window_size = 6,
locations = "US",
types = c("point","quantile"),
targets = inc_case_targets,
source = "zoltar",
verbose = FALSE,
as_of=NULL,
hub = c("US"))
hosp_data <- paste(0:130, "day ahead inc hosp")
forecasts_hosp <- load_latest_forecasts(models = "COVIDhub-ensemble",
last_forecast_date = "2021-03-08",
forecast_date_window_size = 6,
locations = "US",
types = c("point","quantile"),
targets = hosp_data,
source = "zoltar",
verbose = FALSE,
as_of=NULL,
hub = c("US"))
inc_case_targets_death <- paste(1:4, "wk ahead inc death")
forecasts_death <- load_latest_forecasts(models = "COVIDhub-ensemble",
last_forecast_date = "2021-03-08",
forecast_date_window_size = 6,
locations = "US",
types = c("point","quantile"),
targets = inc_case_targets_death,
source = "zoltar",
verbose = FALSE,
as_of=NULL,
hub = c("US"))
And here is the data that is returned from this query, note that in addition to the essential forecast data itself, some additional columns with information about the locations are returned.
This data can then be plotted directly with a call to plot_forecasts().
p <- plot_forecasts(forecast_data = forecasts,
truth_source = "JHU",
target_variable = "inc case",
intervals = c(.5, .8, .95))
p_hosp <- plot_forecasts(forecast_data = forecasts_hosp,
truth_source = "HealthData",
target_variable = "inc hosp",
intervals = c(.5, .8, .95))
p_death <- plot_forecasts(forecast_data = forecasts_death,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95))
Note that many additional arguments may need to be passed to this function to have a reasonable plot returned if your dataset has multiple locations, forecast dates or models.
Additionally the resulting plot object can be modified. For example, if you want to change the way the x-axis handles dates, you could add a custom ggplot scale_x_date() specification:
p + scale_x_date(name=NULL, date_breaks = "1 month", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"), axis.text.x = element_text(vjust = 7, hjust = -0.2))
Additionally, plot_forecasts() can handle plotting multiple models or locations or forecast dates at the same time as the following examples show.
CovidHubUtils package also includes functionaliity to load in versioned forecasts. This is available through the as_of parameter. as_of parameter accepts a date in YYYY-MM-DD format to load forecasts submitted as of this date. It defaults to NULL to load the latest version. This parameter is only supported when loading forecasts from Zoltar. It is useful to see the changes implemented in the forecasts.
# Load forecasts that were submitted in a time window from zoltar
as_of_forecasts <- load_latest_forecasts(models = "Columbia_UNC-SurvCon",
last_forecast_date = "2021-01-03",
source = 'zoltar',
as_of = "2021-01-04",
verbose = FALSE,
location = 'US')
correct_forecast <- load_latest_forecasts(models = "Columbia_UNC-SurvCon",
last_forecast_date = "2021-01-03",
source = 'zoltar',
verbose = FALSE,
location = 'US')
datatable(as_of_forecasts,
extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
))
datatable(correct_forecast,
extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
))
Plotting these 2 different forecasts will show the difference between the forecasts and the versioned forecasts.
p_as_of <- plot_forecasts(forecast_data = as_of_forecasts,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95))
p_correct <- plot_forecasts(forecast_data = correct_forecast,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95))
The hub parameter has been included to incorporate the models that submit forecasts for European countries. The hub parameter is a character vector, where the first element indicates the hub from which to load forecasts. Possible options are “US” and “ECDC”.
Here is an example for using the European hub (ECDC), iif nothing is provided, it defaults to US.
# Load forecasts that were submitted in a time window from zoltar
inc_case_targets <- paste(1:4, "wk ahead inc case")
forecasts_ECDC <- load_latest_forecasts(models=c("ILM-EKF"),
hub = c("ECDC","US"), last_forecast_date = "2021-03-08",
forecast_date_window_size = 0,
locations = c("GB"),
targets = paste(1:4, "wk ahead inc death"),
source = "zoltar",
verbose=FALSE)
datatable(forecasts_ECDC,
extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
))
p_ECDC <- plot_forecasts(forecast_data = forecasts_ECDC,
truth_source = "JHU",
target_variable = "inc death",
intervals = c(.5, .8, .95),
hub = c("ECDC")
)
The following code looks at three models’ forecasts of incident deaths at one time point for one location. Note the use of the fill_by_model option which allows colors to vary by model and the facet command which is passed to ggplot.
fdat <- load_latest_forecasts(models = c("Karlen-pypm", "UMass-MechBayes", "CU-select"),
last_forecast_date = "2021-03-08",
source = "zoltar",
forecast_date_window_size = 6,
locations = "US",
types = c("quantile", "point"),
verbose = FALSE,
targets = paste(1:4, "wk ahead inc death"))
p <- plot_forecasts(fdat,
target_variable = "inc death",
truth_source = "JHU",
intervals = c(.5, .95),
facet = .~model,
fill_by_model = TRUE,
plot=FALSE)
p +
scale_x_date(name=NULL, date_breaks = "1 months", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2))
The following code looks at three models’ forecasts of incident deaths at one time point for multiple locations. Note the use of the facet_scales option which is passed to ggplot and allows the y-axes to be on different scales.
fdat <- load_latest_forecasts(models = c("Karlen-pypm", "UMass-MechBayes", "CU-select"),
last_forecast_date = "2021-03-08",
source = "zoltar",
forecast_date_window_size = 6,
locations = c("19", "48", "46"),
types = c("quantile", "point"),
verbose = FALSE,
targets = paste(1:4, "wk ahead inc death"))
p <- plot_forecasts(fdat,
target_variable = "inc death",
intervals = c(.5, .95),
truth_source = "JHU",
facet = location~model,
facet_scales = "free_y",
fill_by_model = TRUE,
plot=FALSE)
p +
scale_x_date(name=NULL, date_breaks = "1 months", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2))
The following code looks at three models’ forecasts of incident deaths at one time point for multiple locations. Note the use of the facet_scales option which is passed to ggplot and allows the y-axes to be on different scales.
fdat <- load_forecasts(models = c("Karlen-pypm", "UMass-MechBayes"),
forecast_dates = seq.Date(as.Date("2020-12-13"), as.Date("2021-03-14"), by = "28 days"),
locations = "US",
types = c("quantile", "point"),
targets = paste(1:4, "wk ahead inc death"))
## polling for status change. job_url=https://zoltardata.com/api/job/55188/
## QUEUED
## SUCCESS
p <- plot_forecasts(fdat,
target_variable = "inc death",
truth_source = "JHU",
intervals = c(.5, .95),
facet = .~model,
fill_by_model = TRUE,
plot = FALSE)
p + scale_x_date(name=NULL, date_breaks = "1 months", date_labels = "%b") +
theme(axis.ticks.length.x = unit(0.5, "cm"),
axis.text.x = element_text(vjust = 7, hjust = -0.2))
By default plot_forecasts() uses JHU CSSE data as the “Observed Data” in the above plots. However, users can specify custom “ground truth” data that either they provide themselves or that is loaded in from the package.
Here is an example of a call to plot_forecasts() that simply specifies an alternate truth source, which must be one of “JHU”, “USAFacts”, or “NYTimes”.
plot_forecasts(forecast_data = forecasts,
target_variable = "inc case",
locations = "US",
truth_source = "USAFacts",
intervals = c(.5, .8, .95))
Alternatively, truth data can be loaded in from one of those sources independently and stored in your active R session and passed to the plot_forecasts() function.
truth_data <- load_truth(truth_source = "JHU",
target_variable = "inc case",
locations = "US")
Truth data comes in the following tabular format.
And can be used in conjunction with a call to plot_forecasts
plot_forecasts(forecast_data = forecasts,
target_variable = "inc case",
truth_data = truth_data,
truth_source = "USAFacts",
intervals = c(.5, .8, .95))
In addition to downloading forecasts, covidHubUtils has the capability to evaluate the forecasts based on metrics including the prediction interval coverage at any provided quantiles, the absolute error based on a median estimate, the weighted interval score (WIS) of the forecast, and a component-wise breakdown of WIS into sharpness, overprediction and underprediction.
The inputs to the scored_forecasts() include a dataframe created using load_forecasts() and a dataframe created using truth_data().
The following code creates a dataframe of scored data based on the forecasts and truth data loaded earlier in this vignette in long format. This is the example for the USA models:
inc_case_targets <- paste(1:4, "wk ahead inc case")
truth_data <- load_truth(truth_source = "JHU",
target_variable = "inc death",
locations = "US")
forecasts_multiple <- load_forecasts(models = c("COVIDhub-baseline", "COVIDhub-ensemble"),
forecast_dates = as.Date(c("2020-12-14", "2020-12-15")) + seq(0, 35, 7),
locations = "US",
types = c("point","quantile"),
targets = paste(1:4, "wk ahead inc death"),
source = "zoltar",
verbose = FALSE,
as_of=NULL,
hub = c("US"))
scores <- score_forecasts(forecasts = forecasts_multiple,
return_format = "wide",
truth = truth_data)
datatable(scores,
extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
))
This is the example code to run for models in the European Hub:
truth <- load_truth("JHU",hub = c("ECDC","US"), target_variable = "inc death", locations = "GB")
scores_ECDC <- score_forecasts(forecasts = forecasts_ECDC,
return_format = "wide",
truth = truth)
datatable(scores_ECDC,
extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
fixedColumns = list(leftColumns = 2)
))
The scores calculated using this functionality can be used to compare the accuracy and precision of forecasts across models, locations, horizons, and submission weeks.
You can access the evaluation paper to read an in depth explanation about the methodologies and you can also visit the evaluation reports page.
Scores can be visualized through number of aggregations and stratification and the following segments of code should be a good starting point to get to visualizing the scores for your forecasts.
Heatmaps are a great way to visuazlize your data over a large number of models. The following code provides a skeleton for visualizing the WIS score of multiple models over the 4 horizons:
scores %>%
dplyr::group_by(model, horizon) %>%
dplyr::summarise(wis = mean(wis)) %>%
scoringutils::score_heatmap(metric = "wis",
x = "horizon")
The following snippet of code is a way to visualize the WIS score in the form of a bar plot for multiple models. Each of the bar represents the WIS score of a different model and each bar is divided in the how much of it is underprediction, overprediction and sharpness:
scores %>%
dplyr::group_by(model) %>%
dplyr::summarise(sharpness= mean(sharpness),
overprediction = mean(overprediction),
underprediction = mean(underprediction)) %>%
scoringutils::wis_components()
The following snippet of code provides
scores %>%
tidyr::pivot_longer(cols = dplyr::starts_with("coverage")) %>%
dplyr::rename(range = name, coverage = value) %>%
dplyr::group_by(model, range) %>%
dplyr::summarise(coverage = mean(coverage)) %>%
dplyr::mutate(range = as.numeric(sub("coverage_", "", range))) %>%
scoringutils::interval_coverage()